In [1]:
#!conda install requests
#!conda install beautifulsoup4
#!conda install selenium

In [2]:
import requests
from bs4 import BeautifulSoup
import selenium.webdriver
import pandas as pd

Acknowledgements: The code below is very much inspired by Chris Bail's "Screen-Scraping in R". Thanks Chris!

Collecting Digital Trace Data: Web Scraping

Web scraping (also sometimes called "screen-scraping") is a method for extracting data from the web. There are many techniques which can be used for web scraping — ranging from requiring human involvement (“human copy-paste”) to fully automated systems. For research questions where you need to visit many webpages, and collect essentially very similar information from each, web scraping can be a great tool.

The typical web scraping program:

  1. Loads the address of a webpage to be scraped from your list of webpages
  2. Downloads the HTML or XML of that website
  3. Extracts any desired information
  4. Saves that information in a convenient format (e.g. CSV, JSON, etc.)

From Chris Bail's "Screen-Scraping in R": https://cbail.github.io/SICSS_Screenscraping_in_R.html

Legality & Politeness

When the internet was young, web scraping was a common and legally acceptable practice for collecting data on the web. But with the rise of online platforms, some of which rely heavily on user-created content (e.g. Craigslist), the data made available on these sites has become recognized by their companies as highly valuable. Furthermore, from a website developer's perspective, web crawlers are able request many pages from your site in rapid succession, increasing server loads, and generally being a nuisance.

Thus many websites, especially large sites (e.g. Yelp, AllRecipes, Instagram, The New York Times, etc.), have now forbidden "crawlers" / "robots" / "spiders" from harvesting their data in their "Terms of Service" (TOS). From Yelp's Terms of Service:

Before embarking on a research project that will involve web scraping, it is important to understand the TOS of the site you plan on collecting data from.

If the site does allow web scraping (and you've consulted your legal professional), many websites have a robots.txt file that tells search engines and web scrapers, written by researchers like you, how to interact with the site "politely" (i.e. the number of requests that can be made, pages to avoid, etc.).

Requesting a Webpage in Python

When you visit a webpage, your web browser renders an HTML document with CSS and Javascript to produce a visually appealing page. For example, to us, the Boulder Humane Society's listing of dogs available for adoption looks something like what's displayed at the top of the browser below:

But to your web browser, the page actually looks like the HTML source code (basically instructions for what text and images to show and how to do so) shown at the bottom of the page. To see the source code of a webpage, in Safari, go to Develop > Show Page Source or in Chrome, go to Developer > View Source.

To request the HTML for a page in Python, you can use the Python package requests, as such:


In [3]:
pet_pages = ["https://www.boulderhumane.org/animals/adoption/dogs", 
             "https://www.boulderhumane.org/animals/adoption/cats", 
             "https://www.boulderhumane.org/animals/adoption/adopt_other"]

r = requests.get(pet_pages[0])
html = r.text
print(html[:500]) # Print the first 500 characters of the HTML. Notice how it's the same as the screenshot above.


<!DOCTYPE html>
<head>
<meta http-equiv="X-UA-Compatible" content="IE=Edge" />
<meta charset="utf-8" />
<link rel="shortcut icon" href="https://www.boulderhumane.org/sites/default/files/favicon.ico" type="image/vnd.microsoft.icon" />
<meta name="Generator" content="Drupal 7 (http://drupal.org)" />
<meta name="viewport" content="width=1000px, initial-scale=1.0, maximum-scale=1.0" />
<title>Dogs Available for Adoption | Humane Society of Boulder Valley</title>
<link type="text/css" rel="stylesheet

Parsing HTML with BeautifulSoup

Now that we've downloaded the HTML of the page, we next need to parse it. Let's say we want to extract all of the names, ages, and breeds of the dogs, cats, and small animals currently up for adoption at the Boulder Humane Society.

Actually, navigating to the location of those attributes in the page can be somewhat tricky. Luckily HTML has a tree-structure, as shown below, where tags fit inside other tags. For example, the title of the document is located within its head, and within the larger html document (<html> <head> <title> </title> ... </head>... </html>).

From Chris Bail's "Screen-Scraping in R": https://cbail.github.io/SICSS_Screenscraping_in_R.html

To find the first pet on the page, we'll find that HTML element's "CSS selector". Within Safari, hover your mouse over the image of the first pet and then control+click on the image. This should highlight the section of HTML where the object you are trying to parse is found. Sometimes you may need to move your mouse through the HTML to find the exact location of the object (see GIF).

(You can also go to 'Develop > Show Page Source' and then click 'Elements'. Hover your mouse over sections of the HTML until the object you are trying to find is highlighted within your browser.)

BeautifulSoup is a Python library for parsing HTML. We'll pass the CSS selector that we just copied to BeautifulSoup to grab that object. Notice below how select-ing on that pet, shows us all of its attributes.


In [5]:
soup = BeautifulSoup(html, 'html.parser')
pet = soup.select("#block-system-main > div > div > div.view-content > div.views-row.views-row-1.views-row-odd.views-row-first.On.Hold")
print(pet)


[<div class="views-row views-row-1 views-row-odd views-row-first On Hold">
<div class="views-field views-field-field-pp-photo"> <div class="field-content animal-photo"><a href="/animals/adoption/39180261"><img border="0" height="162" src="https://g.petango.com/photos/993/c5543f18-5a65-4703-aab3-c7eb08ea3108.jpg" title="Adopt Me" width="216"/></a></div> </div>
<div class="views-field views-field-field-pp-splashtitle"> <div class="field-content">On Hold</div> </div>
<div class="views-field views-field-field-pp-animalname"> <div class="field-content"><a href="/animals/adoption/39180261" title="Adopt Me!">Bear</a></div> </div>
<div class="views-field views-field-field-pp-primarybreed"> <div class="field-content">Pointer</div> </div>
<div class="views-field views-field-field-pp-secondarybreed"> <div class="field-content">Mix</div> </div>
<div class="views-field views-field-field-pp-age"> <span class="views-label views-label-field-pp-age">Age: </span> <span class="field-content">0 years 2 months</span> </div>
<div class="views-field views-field-field-pp-gender"> <span class="views-label views-label-field-pp-gender">Sex: </span> <span class="field-content">Female</span> </div>
<div class="views-field views-field-field-pp-animalid-1"> <span class="views-label views-label-field-pp-animalid-1">ID: </span> <span class="field-content">39180261</span> </div>
<div class="views-field views-field-edit-node"> <span class="field-content"></span> </div> </div>]

Furthermore, we can select the name, breeds, age, and gender of the pet by find-ing the div tags which contain this information. Notice how the div tag has the attribute (attrs) class with value "views-field views-field-field-pp-animalname".


In [6]:
name = pet[0].find('div', attrs = {'class': 'views-field views-field-field-pp-animalname'})
primary_breed = pet[0].find('div', attrs = {'class': 'views-field views-field-field-pp-primarybreed'})
secondary_breed = pet[0].find('div', attrs = {'class': 'views-field views-field-field-pp-secondarybreed'})
age = pet[0].find('div', attrs = {'class': 'views-field views-field-field-pp-age'})

In [7]:
# We can call `get_text()` on those objects to print them nicely.
print({
    "name": name.get_text(strip = True), 
    "primary_breed": primary_breed.get_text(strip = True), 
    "secondary_breed": secondary_breed.get_text(strip = True),
    "age": age.get_text(strip=True)
})


{'name': 'Bear', 'primary_breed': 'Pointer', 'secondary_breed': 'Mix', 'age': 'Age:0 years 2 months'}

Now to get at the HTML object for each pet, we could find the CSS selector for each. Or, we can exploit the fact that every pet lives in a similar HTML structure for each pet. That is, we can find all of the div tags with the class attribute which contain the string views-row. We'll print out their attributes like we just did.


In [8]:
all_pets = soup.find_all('div', {'class': 'views-row'})

In [9]:
for pet in all_pets:
    name = pet.find('div', {'class': 'views-field views-field-field-pp-animalname'}).get_text(strip=True)
    primary_breed = pet.find('div', {'class': 'views-field views-field-field-pp-primarybreed'}).get_text(strip=True)
    secondary_breed = pet.find('div', {'class': 'views-field views-field-field-pp-secondarybreed'}).get_text(strip=True)
    age = pet.find('div', {'class': 'views-field views-field-field-pp-age'}).get_text(strip=True)
    print([name, primary_breed, secondary_breed, age])


['Bear', 'Pointer', 'Mix', 'Age:0 years 2 months']
['Wally', 'Pointer', 'Mix', 'Age:0 years 2 months']
['Honey', 'Australian Cattle Dog', 'Mix', 'Age:2 years 6 months']
['Mr. Biggs', 'Boxer', 'Mix', 'Age:0 years 10 months']
['Madeline', 'Poodle, Miniature', 'Mix', 'Age:8 years 6 months']
['Rocky Dog', 'Chihuahua, Short Coat', 'Mix', 'Age:3 years 0 months']
['Dr. Octopus', 'Boxer', 'Mix', 'Age:0 years 2 months']
['Roxy', 'German Shepherd', 'Mix', 'Age:0 years 8 months']
['Lola', 'Belgian Malinois', '', 'Age:1 year 0 months']
['Sapheria', 'Pomeranian', 'Mix', 'Age:12 years 5 months']
['Precious', 'Papillon', 'Mix', 'Age:8 years 0 months']
['Bonsai', 'Terrier, Cairn', 'Poodle, Miniature', 'Age:3 years 5 months']
['Pongo', 'Chihuahua, Short Coat', 'Mix', 'Age:5 years 2 months']
['Freddie', 'Hound', 'Mix', 'Age:0 years 4 months']
['Nikki', 'Hound', 'Mix', 'Age:0 years 4 months']
['Lucy', 'Hound', 'Mix', 'Age:0 years 4 months']
['Chandler', 'Terrier, American Pit Bull', 'Mix', 'Age:2 years 0 months']
['LuLu', 'Retriever, Labrador', 'Border Collie', 'Age:3 years 7 months']
['Maggie', 'Australian Cattle Dog', 'Mix', 'Age:2 years 0 months']
['Duke', 'Border Collie', '', 'Age:1 year 0 months']
['Boss', 'Terrier, Boston', 'Mix', 'Age:0 years 3 months']
['Toto', 'Terrier, Cairn', '', 'Age:0 years 10 months']
['Charlie', 'Retriever, Labrador', 'Mix', 'Age:0 years 5 months']

This may seem like a fairly silly example of webscraping, but one could imagine several research questions using this data. For example, if we collected this data over time (e.g. using Wayback Machine), could we identify what features of pets -- names, breeds, ages -- make them more likely to be adopted? Are there certain names that are more common for certain breeds? Or maybe your research question is something even wackier.

Aside: Read Tables from Webpages

Pandas has really neat functionality in read_html where you can download an HTML table directly from a webpage, and load it into a dataframe.


In [10]:
table = pd.read_html("https://en.wikipedia.org/wiki/List_of_sandwiches", header=0)[0]
#table.to_csv("filenamehere.csv") # Write table to CSV

In [11]:
table.head(20)


Out[11]:
Name Image Origin Description
0 Bacon NaN United Kingdom Often eaten with ketchup or brown sauce
1 Bacon, egg and cheese NaN United States Breakfast sandwich, usually with fried or scra...
2 Bagel toast NaN Israel Pressed, toasted bagel filled with vegetables ...
3 Baked bean NaN United States Canned baked beans on white or brown bread, so...
4 Bánh mì[4] NaN Vietnam Filling is typically meat, but can contain a w...
5 Barbecue[5][6][7] NaN United States Served on a bun, with chopped, sliced, or shre...
6 Barros Jarpa NaN Chile Ham and cheese, usually mantecoso, which is si...
7 Barros Luco NaN Chile Beef (usually thin-cut steak) and cheese
8 Bauru NaN Brazil Melted cheese, roast beef, tomato, and pickled...
9 Beef on weck NaN United States(Buffalo, New York) Roast beef on a Kummelweck roll
10 Beirute NaN Brazil Melted cheese, sliced fresh tomatoes with oreg...
11 BLT NaN United States Named for its ingredients: bacon, lettuce, and...
12 Bocadillo de calamares NaN Spain Baguette bread filled with fried squid
13 Bologna NaN United States, Canada Pre-sliced and sometimes fried bologna sausage...
14 Boloney (Bologna) salad sandwich NaN NE Pennsylvania A mixture of bologna sausage and sweet gherkin...
15 Bosna NaN Austria Usually grilled on white bread, containing a b...
16 Bratwurst Sandwich NaN Germany The Bratwurst sandwich is a popular street foo...
17 Breakfast roll NaN United Kingdom and Ireland Convenience dish on a variety of bread rolls, ...
18 Breakfast NaN United States Typically a scrambled or fried egg, cheese, an...
19 British Rail NaN United Kingdom Reference to the poor quality of catering on t...

Requesting a Webpage with Selenium

Sometimes our interactions with webpages involve rendering Javascript. For example, think of visiting a webpage with a search box, typing in a query, pressing search, and viewing the result. Or visiting a webpage that requires a login, or clicking through pages in a list. To handle pages like these we'll use a package in Python called Selenium.

Installing Selenium can be a little tricky. You'll want to follow the directions as best you can here. Requirements (one of the below):

First a fairly simple example: let's visit xkcd and click through the comics.


In [12]:
driver = selenium.webdriver.Safari() # This command opens a window in Safari
# driver = selenium.webdriver.Chrome(executable_path = "<path to chromedriver>") # This command opens a window in Chrome
# driver = selenium.webdriver.Firefox(executable_path = "<path to geckodriver>") # This command opens a window in Firefox

# Get the xkcd website
driver.get("https://xkcd.com/")

In [13]:
# Let's find the 'random' buttom
element = driver.find_element_by_xpath('//*[@id="middleContainer"]/ul[1]/li[3]/a')
element.click()

In [14]:
# Find an attribute of this page - the title of the comic.
element = driver.find_element_by_xpath('//*[@id="comic"]/img')
element.get_attribute("title")


Out[14]:
'They always try to explain that they\'re called \'solar physicists\', but the reporters interrupt with "NEVER MIND THAT, TELL US WHAT\'S WRONG WITH THE SUN!"'

In [15]:
# Continue clicking throught the comics
driver.find_element_by_xpath('//*[@id="middleContainer"]/ul[1]/li[3]/a').click()

In [16]:
driver.quit() # Always remember to close your browser!

We'll now walk through how we can use Selenium to navigate the website to navigate a open source site called "Box Office Mojo".


In [17]:
driver = selenium.webdriver.Safari() # This command opens a window in Safari
# driver = selenium.webdriver.Chrome(executable_path = "<path to chromedriver>") # This command opens a window in Chrome
# driver = selenium.webdriver.Firefox(executable_path = "<path to geckodriver>") # This command opens a window in Firefox

driver.get('https://www.boxofficemojo.com')

Let's say I wanted to know which movie was has been more lucrative 'Wonder Woman', 'Blank Panther', or 'Avengers: Infinity War'. I could type into the search bar on the upper left: 'Avengers: Infinity War'.


In [18]:
# Type in the search bar, and click 'Search'
driver.find_element_by_xpath('//*[@id="leftnav"]/li[2]/form/input[1]').send_keys('Avengers: Infinity War')
driver.find_element_by_xpath('//*[@id="leftnav"]/li[2]/form/input[2]').click()

Now, I can parse the table returned using BeautifulSoup.


In [19]:
# This is what the table looks like
table = driver.find_element_by_xpath('//*[@id="body"]/table[2]')
# table.get_attribute('innerHTML').strip()

In [20]:
pd.read_html(table.get_attribute('innerHTML').strip(), header=0)[2]


Out[20]:
Movie Title (click title to view) Studio Lifetime Gross / Theaters Opening / Theaters Release Links Unnamed: 6 Unnamed: 7
0 Avengers: Infinity War BV $678,807,703 4474 $257,698,183 4474 4/27/2018 NaN

In [21]:
# Find the link to more details about the Avengers movie and click it
driver.find_element_by_xpath('//*[@id="body"]/table[2]/tbody/tr/td/table[2]/tbody/tr[2]/td[1]/b/font/a').click()

Now, we can do the same for the remaining movies: 'Wonder Woman', and 'Black Panther' ...


In [22]:
driver.quit() # Always remember to close your browser!

Again, this might seem like a fairly simple example of web scraping, but there are some fun data science questions you could answer with this new technique. For example, for how many movie franchises, were their sequels more successful than their originals. For example, Furious 7 (see 'Adjusted for Ticket Price Inflation') was the most lucrative Fast and the Furious movie.

Key Takeaways

These two examples hopefully show you how fun web scraping can be, but as I hinted at earlier in some cases web scraping is illegal, and in some cases tedious. So when should you use this new tool? Here are some pointers:

  1. When it is not illegal.
  2. Ideally, when webpages you are trying to scrape share similar HTML structure.
  3. In cases where collecting the data by hand would be prohibitively expensive, in time or money.
  4. When the website you are trying to scrape has not made a publicly accessible API or dataset.

Activity

In this notebook, we learned about how to scrape data from the web using Python's packages requests, BeautifulSoup, and Selenium. Some of you have already had experience with web scraping, but for others, this may have been your first time collecting digital trace data.

This group exercise is designed to find a balance between practicing rudimentary skills (for those of you with little or no experience in this area) to cutting edge techniques (for those of you with extensive expertise in this area). As an added bonus, this exercise not only challenges you to practice your coding skills, but to think about how to ask questions that contribute new knowledge to sociological theory as well.

  1. First, for five minutes, independently brainstorm one or two research questions that you believe can be answered using online data sources and web scraping.
  2. Divide yourselves into groups of three or four. Try to join a group with people you haven't worked with yet.
  3. For 10 minutes, work together to identify a research question based on one of the data sources proposed by your group members.
  4. Evaluate the strengths and weaknesses of the data you plan to collect.
  5. Outline a hybrid research design (e.g. an app or a bot) that could be used to address the weaknesses of the data you collected, or otherwise improve your ability to answer the research question.
  6. (If you have time, write code to collect data from each unit of analysis in your sample. See the code below for help.)

There is only one requirement: the group member with the least amount of experience coding should be responsible for typing the code into a computer. After 45 minutes, we will share our work with the group. Let us know if you'd like to present your group's potential project. Remember that these daily exercises are a way for you to explore new possible topics and to get to know each other better.

List of Online Open Source Data & Websites


In [ ]: